In [1]:
import pysal as ps
import pandas as pd
import numpy as np
A well-used functionality in PySAL is the use of PySAL to conduct exploratory spatial data analysis. This notebook will provide an overview of ways to conduct exploratory spatial analysis in Python.
First, let's read in some data:
In [6]:
data = ps.pdio.read_files(ps.examples.get_path('NAT.shp'))
W = ps.queen_from_shapefile(ps.examples.get_path('NAT.shp'))
W.transform = 'r'
In [5]:
data.head()
Out[5]:
In PySAL, commonly-used analysis methods are very easy to access. For example, if we were interested in examining the spatial dependence in HR90
we could quickly compute a Moran's $I$ statistic:
In [7]:
I_HR90 = ps.Moran(data.HR90.values, W)
In [16]:
I_HR90.I, I_HR90.p_sim
Out[16]:
Thus, the $I$ statistic is $.383$ for this data, and has a very small $p$ value.
We can visualize the distribution of simulated $I$ statistics using the stored collection of simulated statistics:
In [27]:
I_HR90.sim[0:5]
Out[27]:
A simple way to visualize this distribution is to make a KDEplot (like we've done before), and add a rug showing all of the simulated points, and a vertical line denoting the observed value of the statistic:
In [33]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [38]:
sns.kdeplot(I_HR90.sim, shade=True)
plt.vlines(I_HR90.sim, 0, 1)
plt.vlines(I_HR90.I, 0, 40, 'r')
Out[38]:
Instead, if our $I$ statistic were close to our expected value, I_HR90.EI
, our plot might look like this:
In [37]:
sns.kdeplot(I_HR90.sim, shade=True)
plt.vlines(I_HR90.sim, 0, 1)
plt.vlines(I_HR90.EI+.01, 0, 40, 'r')
Out[37]:
In addition to univariate Moran's $I$, PySAL provides many other types of autocorrelation statistics:
In [18]:
c_HR90 = ps.Geary(data.HR90.values, W)
#ps.Gamma
#ps.Join_Counts
In [21]:
c_HR90.C, c_HR90.p_sim
Out[21]:
Since the statistic is below one with a significant $p$-value, it indicates the same thing as the Moran's $I$ above, moderate significant global spatial dependence in HR90
.
In addition, we can compute a global Bivariate Moran statistic, which relates an observation to the spatial lag of another observation:
In [23]:
bv_HRBLK = ps.Moran_BV(data.HR90.values, data.BLK90.values, W)
In [24]:
bv_HRBLK.I, bv_HRBLK.p_sim
Out[24]:
In addition to the Global autocorrelation statistics, PySAL has many local autocorrelation statistics. Let's compute a local Moran statistic for the same data shown above:
In [39]:
LMo_HR90 = ps.Moran_Local(data.HR90.values, W)
Now, instead of a single $I$ statistic, we have an array of local $I_i$ statistics, stored in the .Is
attribute, and p-values from the simulation are in p_sim
.
In [41]:
LMo_HR90.Is, LMo_HR90.p_sim
Out[41]:
We can adjust the number of permutations used to derive every pseudo-$p$ value by passing a different permutations
argument:
In [58]:
LMo_HR90 = ps.Moran_Local(data.HR90.values, W, permutations=9999)
In addition to the typical clustermap, a helpful visualization for LISA statistics is a Moran scatterplot with statistically significant LISA values highlighted.
This is very simple, if we use the same strategy we used before:
First, construct the spatial lag of the covariate:
In [59]:
Lag_HR90 = ps.lag_spatial(W, data.HR90.values)
HR90 = data.HR90.values
Then, we want to plot the statistically-significant LISA values in a different color than the others. To do this, first find all of the statistically significant LISAs. Since the $p$-values are in the same order as the $I_i$ statistics, we can do this in the following way
In [81]:
sigs = HR90[LMo_HR90.p_sim <= .001]
W_sigs = Lag_HR90[LMo_HR90.p_sim <= .001]
insigs = HR90[LMo_HR90.p_sim > .001]
W_insigs = Lag_HR90[LMo_HR90.p_sim > .001]
Then, since we have a lot of points, we can plot the points with a statistically insignficant LISA value lighter using the alpha
keyword. In addition, we would like to plot the statistically significant points in a dark red color.
In [95]:
b,a = np.polyfit(HR90, Lag_HR90, 1)
Matplotlib has a list of named colors and will interpret colors that are provided in hexadecimal strings:
In [100]:
plt.plot(sigs, W_sigs, '.', color='firebrick')
plt.plot(insigs, W_insigs, '.k', alpha=.2)
# dashed vert at mean of the last year's PCI
plt.vlines(HR90.mean(), Lag_HR90.min(), Lag_HR90.max(), linestyle='--')
# dashed horizontal at mean of lagged PCI
plt.hlines(Lag_HR90.mean(), HR90.min(), HR90.max(), linestyle='--')
# red line of best fit using global I as slope
plt.plot(HR90, a + b*HR90, 'r')
plt.text(s='$I = %.3f$' % I_HR90.I, x=50, y=15, fontsize=18)
plt.title('Moran Scatterplot')
plt.ylabel('Spatial Lag of HR90')
plt.xlabel('HR90')
Out[100]:
We can also make a LISA map of the data.
Sometimes, to check for simple spatial heterogeneity, a fixed effects model can be estimated. If the heterogeneity has known bounds. First, though, note that pandas
can build dummy variable matrices from categorical data very quickly:
In [296]:
pd.get_dummies(data.SOUTH) #dummies for south (already binary)
Out[296]:
Where this becomes handy is if you have data you know can be turned into a dummy variable, but is not yet correctly encoded.
For example, the same call as above can make a dummy variable matrix to encode state fixed effects using the STATE_NAME
variable:
In [297]:
pd.get_dummies(data.STATE_NAME) #dummies for state by name
Out[297]:
For now, let's estimate a spatial regimes regression on the south/not-south division. To show how a regimes effects plot may look, let's consider one covariate that is likely related and one that is very likely unrelated to $y$. That is our formal specification for the regression will be:
$$ y = \beta_{0} + x_{1a}\beta_{1a} + x_{1b}\beta_{1b} + x_{2}\beta_{2} + \epsilon $$Where $x_{1a} = 1$ when an observation is not in the south, and zero when it is not. This is a simple spatial fixed effects setup, where each different spatial unit is treated as having a special effect on observations inside of it.
In addition, we'll add an unrelated term to show how the regression effects visualization works:
In [325]:
y = data[['HR90']].values
x = data[['BLK90']].values
unrelated_effect = np.random.normal(0,100, size=y.shape[0]).reshape(y.shape)
X = np.hstack((x, unrelated_effect))
regimes = data.SOUTH.values.tolist()
In [326]:
regime_reg = ps.spreg.OLS_Regimes(y, X, regimes)
In [329]:
betas = regime_reg.betas
sebetas = np.sqrt(regime_reg.vm.diagonal())
In [330]:
sebetas
Out[330]:
In [335]:
plt.plot(betas,'ok')
plt.axis([-1,6,-1,8])
plt.hlines(0,-1,6, color='k', linestyle='--')
plt.errorbar([0,1,2,3,4,5], betas.flatten(), yerr=sebetas*3, fmt='o', ecolor='r')
plt.xticks([-.5,0.5,1.5,2.5,3.5,4.5, 5.5], ['',
'Not South: Constant',
'Not South: BLK90',
'Not South: Not South',
'South: Constant',
'South: South',
'South: Unrelated',''], rotation='325')
plt.title('Regime Fixed Effects')
Out[335]: